Search CORE

254 research outputs found

Minimum error correction-based haplotype assembly: considerations for long read data

Author: de Ridder Dick
Kahaei Mohammad Hossein
Majidian Sina
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2020
Field of study

The single nucleotide polymorphism (SNP) is the most widely studied type of genetic variation. A haplotype is defined as the sequence of alleles at SNP sites on each haploid chromosome. Haplotype information is essential in unravelling the genome-phenotype association. Haplotype assembly is a well-known approach for reconstructing haplotypes, exploiting reads generated by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often used for reconstruction of haplotypes from reads. However, problems with the MEC metric have been reported. Here, we investigate the MEC approach to demonstrate that it may result in incorrectly reconstructed haplotypes for devices that produce error-prone long reads. Specifically, we evaluate this approach for devices developed by Illumina, Pacific BioSciences and Oxford Nanopore Technologies. We show that imprecise haplotypes may be reconstructed with a lower MEC than that of the exact haplotype. The performance of MEC is explored for different coverage levels and error rates of data. Our simulation results reveal that in order to avoid incorrect MEC-based haplotypes, a coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.Comment: 17 pages, 6 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

In silico assessment of a novel single-molecule protein fingerprinting method employing fragmentation and nanopore detection

Author: de Lannoy Carlos
de Ridder Dick
Lucas Florian Leonardus Rudolfus
Maglia Giovanni
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

Summary: The identification of proteins at the single-molecule level would open exciting new venues in biological research and disease diagnostics. Previously, we proposed a nanopore-based method for protein identification called chop-n-drop fingerprinting, in which the fragmentation pattern induced and measured by a proteasome-nanopore construct is used to identify single proteins. In the simulation study presented here, we show that 97.1% of human proteome constituents are uniquely identified under close to ideal measuring circumstances, using a simple alignment-based classification method. We show that our method is robust against experimental error, as 69.4% can still be identified if the resolution is twice as low as currently attainable, and 10% of proteasome restriction sites and protein fragments are randomly ignored. Based on these results and our experimental proof of concept, we argue that chop-n-drop fingerprinting has the potential to make cost-effective single-molecule protein identification feasible in the near future

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Directory of Open Access Journals

Dissertations of the University of Groningen

Caretta – A multiple protein structure alignment and feature extraction suite

Author: Akdel Mehmet
Dijk Aalt D.J., van
Durairaj Janani
Ridder Dick, de
Publication venue
Publication date
Field of study

The vast number of protein structures currently available opens exciting opportunities for machine learning on proteins, aimed at predicting and understanding functional properties. In particular, in combination with homology modelling, it is now possible to not only use sequence features as input for machine learning, but also structure features. However, in order to do so, robust multiple structure alignments are imperative. Here we present Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning. We show Caretta's performance on two benchmark datasets, and present an example application of Caretta in predicting the conformational state of cyclin-dependent kinases.</p

Wageningen University & Research Publications

Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data

Author: Aflitos Saulo Alves
de Jong Hans
de Ridder Dick
Peters Sander
Sanchez-Perez Gabino
Severing Edouard
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background: Identification of biological specimens is a major requirement for a range of applications. Reference-free methods analyse unprocessed sequencing data without relying on prior knowledge, but generally do not scale to arbitrarily large genomes and arbitrarily large phylogenetic distances. Results: We present Cnidaria, a practical tool for clustering genomic and transcriptomic data with no limitation on genome size or phylogenetic distances. We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100% identification accuracy at supra-species level and 78% accuracy for species level. Discussion: CNIDARIA allows for fast, resource-efficient comparison and identification of both raw and assembled genome and transcriptome data. This can help answer both fundamental (e.g. in phylogeny, ecological diversity analysis) and practical questions (e.g. sequencing quality control, primer design).Comment: 47 pages, 13 figure

arXiv.org e-Print Archive

Springer - Publisher Connector

PubMed Central

Topology of molecular interaction networks

Author: Dick De Ridder
Huijuan Wang
Marcel Reinders
Piet Van Mieghem
Wynand Winterbach
Publication venue
Publication date: 03/04/2020
Field of study

Abstract Molecular interactions are often represented as network models which have become the common language of many areas of biology. Graphs serve as convenient mathematical representations of network models and have themselves become objects of study. Their topology has been intensively researched over the last decade after evidence was found that they share underlying design principles with many other types of networks. Initial studies suggested that molecular interaction network topology is related to biological function and evolution. However, further whole-network analyses did not lead to a unified view on what this relation may look like, with conclusions highly dependent on the type of molecular interactions considered and the metrics used to study them. It is unclear whether global network topology drives function, as suggested by some researchers, or whether it is simply a byproduct of evolution or even an artefact of representing complex molecular interaction networks as graphs. Nevertheless, network biology has progressed significantly over the last years. We review the literature, focusing on two major developments. First, realizing that molecular interaction networks can be naturally decomposed into subsystems (such as modules and pathways), topology is increasingly studied locally rather than globally. Second, there is a move from a descriptive approach to a predictive one: rather than correlating biological network 1 topology to generic properties such as robustness, it is used to predict specific functions or phenotypes. Taken together, this change in focus from globally descriptive to locally predictive points to new avenues of research. In particular, multi-scale approaches are developments promising to drive the study of molecular interaction networks further

CiteSeerX

Family-Based Haplotype Estimation and Allele Dosage Correction for Polyploids Using Short Sequence Reads

Author: Chris Maliepaard
Dick de Ridder
Ehsan Motazedi
Ehsan Motazedi
Richard Finkers
Richard Visser
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2019
Field of study

DNA sequence reads contain information about the genomic variants located on a single chromosome. By extracting and extending this information using the overlaps between the reads, the haplotypes of an individual can be obtained. Using parent-offspring relationships in a population can considerably improve the quality of the haplotypes obtained from short reads, as pedigree information can be used to correct for spurious overlaps (due to sequencing errors) and insufficient overlaps (due to short read lengths, low genomic variation and shallow coverage). We developed a novel method, PopPoly, to estimate polyploid haplotypes in an F1-population from short sequence data by taking into consideration the transmission of the haplotypes from the parents to the offspring. In addition, this information is employed to improve genotype dosage estimation and to call missing genotypes in the population. Through simulations, we compare PopPoly to other haplotyping methods and show its better performance. We evaluate PopPoly by applying it to a tetraploid potato cross at nine genomic regions involved in tuber formation

Directory of Open Access Journals

Wageningen University & Research Publications

Genomic prediction in plants: opportunities for ensemble machine learning based approaches [version 2; peer review: 1 approved, 2 approved with reservations]

Author: Aalt D.J. van Dijk
Dick de Ridder
Harm Nijveen
Muhammad Farooq
Shahid Mansoor
Publication venue: 'F1000 Research Ltd'
Publication date: 01/01/2023
Field of study

Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability (h2 and h2e), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners

Directory of Open Access Journals

Shifts in growth strategies reflect tradeoffs in cellular economics

Author: Bas Teusink
Berg JM
Dick de Ridder
Douwe Molenaar
Helling RB
Ingraham JL
Koch AL
Mikkola R
Neidhardt FC
Rogier van Berlo
Publication venue: Nature Publishing Group
Publication date: 01/01/2009
Field of study

The growth rate-dependent regulation of cell size, ribosomal content, and metabolic efficiency follows a common pattern in unicellular organisms: with increasing growth rates, cell size and ribosomal content increase and a shift to energetically inefficient metabolism takes place. The latter two phenomena are also observed in fast growing tumour cells and cell lines. These patterns suggest a fundamental principle of design. In biology such designs can often be understood as the result of the optimization of fitness. Here we show that in basic models of self-replicating systems these patterns are the consequence of maximizing the growth rate. Whereas most models of cellular growth consider a part of physiology, for instance only metabolism, the approach presented here integrates several subsystems to a complete self-replicating system. Such models can yield fundamentally different optimal strategies. In particular, it is shown how the shift in metabolic efficiency originates from a tradeoff between investments in enzyme synthesis and metabolic yields for alternative catabolic pathways. The models elucidate how the optimization of growth by natural selection shapes growth strategies

Crossref

VU Research Portal

PubMed Central